The 8 th Linguistic Annotation Workshop in conjunction with COLING 2014

نویسندگان

Lori Levin

Manfred Stede

چکیده

Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language. First evaluations have shown that the most prominent cause (over 50%) of errors in the existing automatized POS-tagging of transcripts of spoken German with the Stuttgart Tübingen Tagset (STTS) and the treetagger was the inaccurate interpretation of speech particles. One reason for this is that this class of words is virtually absent from the current STTS. This paper proposes a recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics. The ultimate aim is to create a comprehensive reference corpus of spoken German data for the global research community. It is imperative that all phenomena are reliably recorded in future part-of-speech tag labels.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Empirical Analysis of Aggregation Methods for Collective Annotation

We investigate methods for aggregating the judgements of multiple individuals in a linguistic annotation task into a collective judgement. We define several aggregators that take the reliability of annotators into account and thus go beyond the commonly used majority vote, and we empirically analyse their performance on new datasets of crowdsourced data.

متن کامل

Use of Coreference in Automatic Searching for Multiword Discourse Markers in the Prague Dependency Treebank

The paper introduces a possibility of new research offered by a multi-dimensional annotation of the Prague Dependency Treebank. It focuses on exploitation of the annotation of coreference for the annotation of discourse relations expressed by multiword expressions. It tries to find which aspect interlinks these linguistic areas and how we can use this interplay in automatic searching for Czech ...

متن کامل

Interactive Annotation for Event Modality in Modern Standard and Egyptian Arabic Tweets

We present an interactive procedure to annotate a large-scale corpus of Modern Standard and Egyptian Arabic tweets for event modality that comprises obligation, permission, commitment, ability, and volition. The procedure splits up the annotation process into a series of simplified questions, dispenses with the requirement of expert linguistic knowledge, and captures nested modality triggers an...

متن کامل

Finding your "Inner-Annotator": An Experiment in Annotator Independence for Rating Discourse Coherence Quality in Essays

An experimental annotation method is described, showing promise for a subjective labeling task – discourse coherence quality of essays. Annotators developed personal protocols, reducing front-end resources: protocol development and annotator training. Substantial inter-annotator agreement was achieved for a 4-point scale. Correlational analyses revealed how unique linguistic phenomena were cons...

متن کامل

3arif: A Corpus of Modern Standard and Egyptian Arabic Tweets Annotated for Epistemic Modality Using Interactive Crowdsourcing

We present 3arif, a large-scale corpus of Modern Standard and Egyptian Arabic tweets annotated for epistemic modality. To create 3arif, we design an interactive crowdsourcing annotation procedure that splits up the annotation process into a series of simplified questions, dispenses with the requirement for expert linguistic knowledge and captures nested modality triggers and their attributes se...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

The 8 th Linguistic Annotation Workshop in conjunction with COLING 2014

نویسندگان

چکیده

منابع مشابه

Empirical Analysis of Aggregation Methods for Collective Annotation

Use of Coreference in Automatic Searching for Multiword Discourse Markers in the Prague Dependency Treebank

Interactive Annotation for Event Modality in Modern Standard and Egyptian Arabic Tweets

Finding your "Inner-Annotator": An Experiment in Annotator Independence for Rating Discourse Coherence Quality in Essays

3arif: A Corpus of Modern Standard and Egyptian Arabic Tweets Annotated for Epistemic Modality Using Interactive Crowdsourcing

عنوان ژورنال:

اشتراک گذاری